Now that we've learned about dplyr we can begin to learn about tidyr which is a complementary package that will help us create tidy data sets! So what do we mean when we say "tidy data"?
Tidy data is when we have a data set where every row is an observation and every column is a variable, this way the data is organized in such a way where every cell is a value for a specific variable of a specific observation. Having your data in this format will help build an understanding of your data and allow you to analyze or visualize it quickly and efficiently.
After viewing this lecture, you can reference this handy cheatsheet on Data Wrangling
install.packages('tidyr',repos = 'http://cran.us.r-project.org')
library(tidyr)
library(data.table)
All data.tables are also data.frames. Loosely speaking, you can think of data.tables as data.frames with extra features.
data.frame is part of base R.
data.table is a package that extends data.frames. Two of its most notable features are speed and cleaner syntax.
However, that syntax for a data.table is different from the standard R syntax for data.frame while being hard for the untrained eye to distinguish at a glance. Therefore, if you read a code snippet and there is no other context to indicate you are working with data.tables and try to apply the code to a data.frame it may fail or produce unexpected results.
So what are some of the practical differences? Here are a few:
We'll cover some of the most useful functions in tidyr. Including the following:
Which basically perform the following actions:

Let's create some fake data that needs to be cleaned using tidyr
comp <- c(1,1,1,2,2,2,3,3,3)
yr <- c(1998,1999,2000,1998,1999,2000,1998,1999,2000)
q1 <- runif(9, min=0, max=100)
q2 <- runif(9, min=0, max=100)
q3 <- runif(9, min=0, max=100)
q4 <- runif(9, min=0, max=100)
df <- data.frame(comp=comp,year=yr,Qtr1 = q1,Qtr2 = q2,Qtr3 = q3,Qtr4 = q4)
df
Sometimes people like to think of these operations as analogous to pivot tables in excel, let's see some examples of how to use them:
The gather() function will collapse multiple columns into key-pair values. The data frame above is considered wide since the time variable (represented as quarters) is structured such that each quarter represents a variable. To re-structure the time component as an individual variable, we can gather each quarter within one column variable and also gather the values associated with each quarter in a second column variable.
# Using Pipe Operator
head(df %>% gather(Quarter,Revenue,Qtr1:Qtr4))
# With just the function
head(gather(df,Quarter,Revenue,Qtr1:Qtr4))
This is the complement of gather(), which is why its called spread():
stocks <- data.frame(
time = as.Date('2009-01-01') + 0:9,
X = rnorm(10, 0, 1),
Y = rnorm(10, 0, 2),
Z = rnorm(10, 0, 4)
)
stocks
stocksm <- stocks %>% gather(stock, price, -time)
stocksm %>% spread(stock, price)
stocksm %>% spread(time, price)
Given either regular expression or a vector of character positions, separate() turns a single character column into multiple columns.
df <- data.frame(x = c(NA, "a.x", "b.y", "c.z"))
df
df %>% separate(x, c("ABC", "XYZ"))
Unite is a convenience function to paste together multiple columns into one.
head(mtcars)
unite_(mtcars, "vs.am", c("vs","am"),sep = '.')
# Separate is the complement of unite
mtcars %>%
unite(vs_am, vs, am) %>%
separate(vs_am, c("vs", "am"))
Hopefully you'll find tidyr useful when having to clean up your data!